An extensible approach to high-quality multilingual type setting
نویسندگان
چکیده
We propose to create and study a new model for the micro-typography part of automated multilingual typesetting. This new model will support quality typesetting for a number of modern and ancient scripts. The major innovations in the proposal are: the process is refined into four phases, each dependent on a multidimensional tree-structured context summarizing the current linguistic and cultural environment. The four phases are: preparing the input stream for typesetting; segmenting the stream into clusters (words); typesetting these clusters; and then recombining the clusters into a typeset text stream. The context is pervasive throughout the process; the algorithms used in each phase are context-dependent, as are the meanings of fundamental entities such as language, script, font and character.
منابع مشابه
Text analysis and language identification for polyglot text-to-speech synthesis
In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is heard, i.e., with correct language-sp...
متن کاملMineral composition and geothermobarometry of Mata basaltic rocks (SouthKerman): An indicator of magma type and tectonic setting
Basaltic volcanic rocks with pillow structures are exposed at the southeasternmost extremity of the Sanandaj-Sirjan zone.They belong to volcanic-sedimentary complexes that formed in a general northwest-southeast to north-south trend. In this research, the mineral chemistry of clinopyroxene and plagioclases are employed to study physicochemical conditions and paleo-tectonic setting during genera...
متن کاملComparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites
In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...
متن کاملUIDS: A Multilingual Document Summarization Framework Based on Summary Diversity and Hierarchical Topics
In this paper, we put forward UIDS, a new high-performing extensible framework for extractive MultiLingual Document Summarization. Our approach looks on a document in a multilingual corpus as an item sequence set, in which each sentence is an item sequence and each item is the minimal semantic unit. Then we formalize the extractive summary as summary diversity sampling problem that considers to...
متن کاملActive Learning for Multilingual Statistical Machine Translation
Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and constructing high quality MT systems, from each language in the collection into this new target lan...
متن کامل